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Abstract. A common complaint about adaptive prefix coding is that it 
is much slower than static prefix coding. Karpinski and Nekrich recently 
took an important step towards resolving this: they gave an adaptive 
Shannon coding algorithm that encodes each character in 0(1) amor- 
tized time and decodes it in 0{\ogH) amortized time, where H is the 
empirical entropy of the input string s. For comparison, Gagie's adaptive 
Shannon coder and both Knuth's and Vitter's adaptive Huffman coders 
all use 0{H) amortized time for each character. In this paper we give an 
adaptive Shannon coder that both encodes and decodes each character in 
0(1) worst-case time. As with both previous adaptive Shannon coders, 
we store s in at most {H + l)\s\ + o{\s\) bits. We also show that this 
encoding length is worst-case optimal up to the lower order term. 



1 Introduction 

Adaptive prefix coding is a well studied problem whose well known and widely 
used solution, adaptive Huffman coding, is nevertheless not worst-case optimal. 
Suppose we are given a string s of length m over an alphabet of size n. For static 
prefix coding, we are allowed to make two passes over s but, after the first pass, 
we must build a single prefix code, such as a Shannon code [16] or Huffman 
code [9], and use it to encode every character. Since a Huffman code minimizes 
the expected codeword length, static Huffman coding is optimal (ignoring the 
asymptotically negfigible 0(n log n) bits needed to write the code). For adaptive 
prefix coding, we are allowed only one pass over s and must encode each character 
with a prefix code before reading the next one, but we can change the code 
after each character. Assuming we compute each code deterministically from 
the prefix of s already encoded, we can later decode s symmetrically. The most 
intuitive solution is to encode each character using a Huffman code for the prefix 
already encoded; Knuth [11] showed how to do this in time proportional to the 
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length of the encoding produced, taking advantage of a property of Huffman 
codes discovered by Faller [3] and Gallager [7]. Shortly thereafter, Vitter [18] 
gave another adaptive Huffman coder that also uses time proportional to the 
encoding's length; he proved his coder stores s in fewer than m more bits than 
static Huffman coding, and that this is optimal for any adaptive Huffman coder. 
With a similar analysis, Milidiii, Laber and Pessoa [13] later proved Knuth's 
coder uses fewer than 2m more bits than static Huffman coding. In other words, 
Knuth's and Vittcr's coders store s in at most (H + 2 + h)m. + o{m) and {H + 
1 + h)m + o(m) bits, respectively, where H = X)a(occ(a, s)/m) log(m/occ(a, s)) 
is the empirical entropy of s (i.e., the entropy of the normalized distribution of 
characters in s), occ(a, s) is the number of occurrences of the character a in s, 
and h e [0, 1) is the redundancy of a Huffman code for s; therefore, both adaptive 
Huffman coders use 0{H) amortized time to encode and decode each character of 
s. Turpin and Moffat [17] gave an adaptive prefix coder that uses canonical codes, 
and showed it achieves nearly the same compression as adaptive Huffman coding 
but runs much faster in practice. Their upper bound was still 0{H) amortized 
time for each character but their work raised the question of asymptotically faster 
adaptive prefix coding. In all of the above algorithms the encoding and decoding 
times are proportional to the hit length of the encoding. This implies that we 
need 0{H) time to encode/decode each symbol; since entropy H depends on the 
size of the alphabet, the running times grow with the alphabet size. 

The above results for adaptive prefix coding are in contrast to the algorithms 
for the prefix coding in the static scenario. The simplest static Huffman coders 
use 0{H) amortized time to encode and decode each character but, with a lookup 
table storing the codewords, it is not hard to speed up encoding to take 0(1) 
worst-case time for each character. We can also decode an arbitrary prefix code 
in 0(1) time using a look-up table, but the space usage and initialization time 
for such a table can be prohibitively high, up to 0{m). Moffat and Turpin [14] 
described a practical algorithm for decoding prefix codes in 0(1) time; their 
algorithm works for a special class of prefix codes, the canonical codes introduced 
by Schwartz and KaUick [15]. 

While all adaptive coding methods described above maintain the optimal 
Huffman code, Gagie [6] described an adaptive prefix coder that is based on 
sub-optimal Shannon coding; his method also needs 0{H) amortized time per 
character for both encoding and decoding. Although the algorithm of [6] main- 
tains a Shannon code that is known to be worse than the Huffman code in 
the static scenario, it achieves {H + l)m + o{m) upper bound on the encoding 
length that is better than the best known upper bounds for adaptive Huffman 
algorithms. Karpinski and Nekrich [10] recently reduced the gap between static 
and adaptive prefix coding by using quantized canonical coding to speed up an 
adaptive Shannon coder of Gagie [6]: their coder uses 0(1) amortized time to 
encode each character and 0{logH) amortized time to decode it; the encoding 
length is also at most {H + l)m + o(m) bits. 

In this paper we describe an algorithm that both encodes and decodes each 
character in 0(1) worst-case time, while still using at most {H + l)m + o(m) 
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bits. It can be shown that the encoding length of any adaptive prefix coding 
algorithm is {H + l)ni — o{m) bits in the worst case. Thus our algorithm works in 
optimal worst-case time independently of the alphabet size and achieves optimal 
encoding length (up to the lower-order term). As is common, we assume n -^w}] 
for simplicity, we also assume s contains at least two distinct characters and m 
is given in advance. Our model is a unit-cost word RAM with i7(logm)-bit 
words on which it takes 0(1) time to input or output a word. Our encoding 
algorithm uses only addition and bit operations; our decoding algorithm also 
uses multiplication and finding the most significant bit in 0(1) time. We can 
also implement the decoding algorithm, so that it uses AC^ operations only. 
Encoding needs 0{n) words of space, and decoding needs O(nlogm) words of 
space. The decoding algorithm can be implemented with bit operations only at 
a cost of higher space usage and additional pre-processing time. For an arbitrary 
constant a > 0, we can construct in 0(m") time a look-up table that uses 
0(m") space; this look-up table enables us to implement multiplications with 
0(1) table look-ups and bit operations. 

While the algorithm of [10] uses quantized coding, i.e., coding based on the 
quantized symbol probabilities, our algorithm is based on delayed probabilities: 
encoding of a symbol s[i] uses a Shannon code for the prefix s[l..i — c?] for 
an appropriately chosen parameter d; henceforth s[i] denotes the i-th symbol 
in the string s and s[i..j] denotes the substring of s that consists of symbols 
s[i]s[i + 1] . . . s[j]. In Section 2 we describe canonical Shannon codes and explain 
how they can be used to speed up Shannon coding. In Section 3 we describe two 
useful data structures that allow us to maintain the Shannon code efficiently. 
We present our algorithm and analyze the number of bits it needs to encode a 
string in Section 4. In Section 5 we prove a matching lower bound by extending 
Vitter's lower bound from adaptive Huffman coders to all adaptive prefix coders. 
In section 6 we show that our technique can be applied to the online stable 
sorting problem. In section 7 we describe how we can use the same approach 
of delayed adaptive coding to achieve 0(1) worst-case encoding time for several 
other coding problems in the adaptive scenario. 

2 Canonical Shannon coding 

Shannon [16] defined the entropy H{P) of a probability distribution P = pi, . . . ,p, 
to be Pi log2(l/pi).^ He then proved that, if P is over an alphabet, then we 
can assign each character with probability pi > a prefix-free binary codeword 
of length [log2(l/pi)] , so the expected codeword length is less than H{P) -I- 1; we 
cannot, however, assign them codewords with expected length less than H{P).^ 

^ In fact, our main result is valid if n = o(m/ log^^^ m). For the result of section 6 
and two resu lts in section 7 we need a somewhat stronger assumption that n = 

o{y^m/ log m) 
2 We assume 01og(l/0) = 0. 

^ In fact, these bounds hold for any size of code alphabet; we assume throughout that 
codewords are binary, and by log we always mean log2. 
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Shannon's proof of his upper bound is simple: without loss of generality, assume 
> • • • > p„ > 0; for 1 < i < n, let 6i = X^jCiP^; since \bi — bi'\ > pi for 
i' ^ i, the first [log(l/pi)] bits of 6i's binary representation uniquely identify 
it; let these bits be the codeword for the ith character. The codeword lengths 
do not change if, before applying Shannon's construction, we replace each pi 
by l/2r'°s(i/P»)l . The code then produced is canonical [15]: i.e., if a codeword 
is the cth of length r, then it is the first r bits of the binary representation of 
Y^iZi W{£)/2^ + (c — l)/2'', where W{£) is the number of codewords of length 
i. For example, 



1) 000 7) 1000 

2) 001 8) 1001 

3) 0100 9) 10100 

4) 0101 10) 10101 

5) 0110 ; 

6) 0111 16) 11011 



are the codewords of a canonical code. Notice the codewords are always in lexi- 
cographic order. 

Static prefix coding with a Shannon code stores s in {H + l)m + o(m) bits. 
An advantage to using a canonical Shannon code is that wc can easily encode 
each character in 0(1) worst-case time (apart from first pass) and decode it 
symmetrically in O(loglogrn) worst-case time (see [14]). To encode a symbol 
s[i] from s, it suffices to know the pair (r, c) such that the codeword for s[i] is 
the c-th codeword of length r, and the first codeword Ir of length r. Then the 
codeword for s[i] can be computed in 0(1) time as Ir + c. We store the pair (r, c) 
for the fc-th symbol in the alphabet in the fc-th entry of the array C. The array 
L[l..[logm]] contains first codewords of length I for each 1 < ^ < [logm]. Thus, 
if we maintain arrays L and C we can encode a character from s in 0(1) time. 

For decoding, wc also need a data structure D of size log m and a matrix 
M of size n x [log m] . For each I such that there is at least one codeword of 
length I, the data structure D contains the first codeword of length I padded 
with [logm] — I O's. For an integer q, D can find the predecessor of q in D, 
pred((7, Z?) — max{a; £ D\x < q}. The entry M[r, c] of the matrix M contains 
the symbol s, such that the codeword for s is the c-th codeword of length r. The 
decoding algorithm reads the next [logm] bits into a variable w and finds the 
predecessor of w in D. When pred(w, D) is known, we can determine the length 
r of the next codeword, and compute its index c as {w — L[r]) ^ ([logm] — r) 
where denotes the right bit shift operation. 

The straightforward binary tree solution allows us to find predecessors in 
O (log logm) time. We will see in the next section that predecessor queries on a 
set of logm elements can be answered in 0(1) time. Hence, both encoding and 
decoding can be performed in 0(1) time in the static scenario. In the adaptive 
scenario, we must find a way to maintain the arrays C and L efficiently and, in 
the case of decoding, the data structure D. 
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3 Data structures 

It is not hard to speed up the method for decoding we described in Section 2. 
For example, if the augmented binary search tree we use as D is optimal in- 
stead of balanced then, by Jensen's Inequality, we decode each character in 
O(logiJ) amortized time. Even better, if we use a data structure by Fredman 
and Willard [4], then we can decode each character in 0(1) worst-case time. 

Lemma 1 (Fredman and Willard, 1993). Given 0{log^^^ m) keys, inOllog^' 
worst-case time we can build a data structure that stores those keys and supports 
predecessor queries in 0{1) worst-case time. 

Corollary 1. Given O(logm) keys, in 0(log^^^ m) worst-case time we can build 
a data structure that stores those keys and supports predecessor queries in 0{1) 

worst-case time. 

Proof. We store the keys in the leaves of a search tree with degree 0(log^/® m), 
size 0(log^^^ m) and height at most 5. Each node stores an instance of Fredman 
and Willard's data structure from Lemma 1: each data structure associated 
with a leaf stores 0(log^^® m) keys and each data striicture associated with an 
internal node stores the first key in each of its children's data structures. It is 
straightforward to build the search tree in 0(log^/^~'~^/^ m) = 0(log^/^ m) time 
and implement queries in 0(1) time. 

In Lemma 1 and Corollary 1 we assume that multiplication and finding the 
most significant bit of an integer can be performed in constant time. As shown 
in [1] , we can implement the data structure of Lemma 1 using AC° operations 
only. We can restrict the set of elementary operations to bit operations and table 
look-ups by increasing the space usage and preprocessing time to O(m^). In our 
case all keys in the data structure D are bounded by m; hence, we can construct 
in 0{rrf ) time a look-up table that uses 0{nf ) space and allows us to multiply 
two integers or find the most significant bit of an integer, in constant time. 

Corollary 1 is useful to us because the data structure it describes not only 
supports predecessor queries in 0(1) worst-case time but can also be built in 
time polylogarithmic in m; the latter property will let our adaptive Shannon 
coder keep its data structures nearly current by regularly rebuilding them. The 
array C and matrix M cannot be built in o{n) time, however, so we combine 
them in a data structure that can be updated incrementally. Arrays C[] and L[], 
and the matrix M defined in the previous section, can be rebuilt as described in 
the next Lemma. 

Lemma 2. If codeword lengths of f > logm symbols are changed, we can rebuild 
arrays C[] and L\\, and update the matrix M in 0{f) time. 

Proof. We maintain an array S of doubly- linked lists. The doubly- linked list S[l] 
contains all symbols with codeword length / sorted by their codeword indices, 
i.e, the codeword of the c-th symbol in array S[r\ is the c-th codeword of length 
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r. The number of codewords with lengths I for each 1 < ? < [logm] is stored 
in the array WW- If the codeword length of a symbol a is changed from to 
I2, we replace a with last[/i] in S[li] and append a at the end of S[l2]- Then, 
we set C[ki] = C[ka] and C[ka] = (^2, W^[^2]), where ka and ki are indices of o 
and last[/i] in the array C[]. Finally, we increment VF[^2], decrement W[li] and 
update last[Zi] and last[Z2]- We also update entries M[li, Ca], M[li,W[li] — 1] and 
M[l2, W[l2]] in the matrix M accordingly, where Co is the index of a's codeword 
before the update operation. Thus the array C and the matrix M can be updated 
in 0(1) time when the codeword length of a symbol is changed. 

When codeword lengths of all / symbols are changed, we can compute the 
array L from scratch in O(logm) = 0{f) time. 

4 Algorithm 

The; main idea of the algorithm of [10], that achieves 0(1) amortized encoding 
cost per symbol, is quantization of probabilities. The Shannon code is main- 
tained for the probabilities pj = [occCa^'ili^ i ] )/g\ wti^re occ(aj, s[l..z]) denotes 
the number of occurrences of the symbol aj in the string s[l..i] and the param- 
eter q = 0{logm). The symbol must occur q times before the denominator 
of the fraction pi is incremented by 1. Roughly speaking, the value of pi, and 
hence the codeword length of Oj, changes at most once after logm occurrences 
of fli. As shown in Lemma 2, we can rebuild the arrays C[] and L[] in O(logTO) 
time. Therefore encoding can be implemented in 0(1) amortized time. However, 
it is not clear how to use this approach to obtain constant worst-case time per 
symbol. 

In this paper a different approach is used. Symbols s[2 + l], s[2 + 2], . . . , s[i + d] 
are encoded with a Shannon code for the prefix s[l]s[2] . . . s[i — d] of the input 
string. Recall that in a traditional adaptive code the symbol s[i + 1] is encoded 
with a code for s[l] . . . s[i]. While symbols s[i + 1] . . . s[i + d] are encoded, we 
build an optimal code for s[l] . . . s[i]. The next group of symbols, i.e. s[i + d + 
1] . . . s[i + 2d] will be encoded with a Shannon code for ,s[l] . . . s[«], and the 
code for s[l] . . . s[i + d] will be simultaneously rebuilt in the background. Thus 
every symbol s[j] is encoded with a Shannon code for the prefix s[l] . . . s[j — t], 
d < t < 2d, of the input string. That is, when a symbol s[i] is encoded, its 
codeword length equals 

, i + n — t 

log 

max(occ(s[z], s[l..i — t]), 1) 

We increased the enumerator of the fraction by n and the denominator is always 
at least 1 because wc assume that every character is assigned a codeword of 
length [logn — d] before encoding starts. We make this assumption only to 
simplify the description of our algorithm. There are others methods of dealing 
with characters that occur for the first time in the input string that are more 
practically efficient, see e.g. [11]. The method of [11] can also be used in our 
algorithm, but it would not change the total encoding length. 
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Later we will show that the delay of at most 2d increases the length of 
encoding only by a lower order term. Now we turn to the description of the 
procedure that updates the code, i.e. we will show how the code for s[l] . . .s[i] 
can be obtained from the code for s[l] . . . s[i — d]. 

Let C be an optimal code for s[l] . . .s[i — d] and C be an optimal code for 
s[l] . . . s[i] . As shown in section 2, updating the code is equivalent to updating the 
arrays C[] and L[], the matrix M, and the data structure D. Since a group of d 
symbols contains at most d different symbols, we must change codeword lengths 
of at most d codewords. The list of symbols ai, . . . , afe such that the codeword 
length of ttk must be changed can be constructed in 0{d) time. We can construct 
an array L[] for the code C in 0(max((i, logm)) time by Lemma 2. The matrix 
M and the array C[] can be updated in 0{d) time because only 0{d) cells of 
M are modified. However, we cannot build new versions of M and C[] because 
they contain (9(n log rn) and 0{n) cells respectively. Since we must obtain the 
new version of M while the old version is still used, wo modify M so that each 
cell of M is allowed to contain two different values, an old one and a new one. 
For each cell (r, c) of M we store two values M[r, c].old and M[r, c].new and the 
separating value M[r, c\.b: when the symbol s[t], t < M[r, c\.b, is decoded, we use 
M[r,c].old; when the symbol s[t], t > M[r,c].b, is decoded, we use M[r,c].new. 
The procedure for updating M works as follows: we visit all cells of M that 
were modified when the code C was constructed. For every such ccill we set 
M[r, c].old = M[r, c\.new and M[r, c\.b = +oo. Then, we add the new values for 
those cells of M that must be modified. For every cell that must be modified, 
the new value is stored in M[r,c\.ne'w and M[r,c\.b is set to « + d. The array 
C[] can be updated in the same way. When the array L[] is constructed, we can 
construct the data structure D in 0(log^^^ m) time. 

The algorithm described above updates the code if the codeword lengths of 
some of the symbols s[i — d + 1] . . . s[i\ are changed. But if some symbol a does 
not occur in the substring s[i — d + I] . . . s[i\, its codeword length might still 
change in the case when log(i) > log(ia) where ia = max{j < i\s[j] = a}. We 
can, however, maintain the following invariant on the codeword length 



i + 2n 

max(occ(a, s[l..i — 2d]), 1) 

(1) 

When the codeword length of a symbol a must be modified, we set its length 
to [log inax(occ(a^"[i i ] ) 1) 1 • symbols o are also stored in the queue Q. When 
the code C is constructed, we extract the first d symbols from Q, check whether 
their codeword lengths must be changed, and append those symbols at the end 
of Q. Thus the codeword length of each symbol is checked at least once when 
an arbitrary substring s[u] . . . s[u-|-n] of the input string s is processed. Clearly, 
the invariant 1 is maintained. 

Thus the procedure for obtaining the code C from the code C consists of the 
following steps: 



log 



i + n 



max(occ(a, sll.A — 2d]), 1) 



<la< 
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1. check symbols s[i — . . . s[i] and the first d symbols in the queue Q; construct 

a list of codewords whose lengths must be changed; remove the first d symbols 
from Q and append them at the end of Q 

2. traverse the list Af of modified cells in the matrix M and the array C[], and 

remove the old values from those cells; empty the list Af 

3. update the array C[] for code C; simultaneously, update the matrix M and 

construct the list Af of modified cells in M and C[] 

4. construct the array L and the data structure D for the new code C 

Each of the steps described above, except the last one, can be performed in 
0{d) time; the last step can be executed in 0(max(d, log^^^ m)) time. For d = 
[log^/^mJ/2, codec can be constructed in 0{d) time. If the cost of constructing 
C is evenly distributed among symbols s[i — d], . . . , s[i], then wc spend 0(1) 
extra time when each symbol s[j] is processed. Since occ(a, s[l..i— [log"^^^ mj]) > 
max(occ(a, s[l..i]) — [log^^^ mJ, 1), we need at most 



bits to encode s[i]. 

Lemma 3. We can keep an adaptive Shannon code such that, for 1 < i < m, 
the codeword for s[i] has length at most 



and we use 0(1) worst-case time to encode and decode each character. 

Gagie [6] and Karpinski and Nekrich [10] proved inequalities that, together 
with Lemma 3, immediately yield our result. 

Theorem 1. We can encode s in at most {H + l)m+o{m) bits with an adaptive 
prefix coding algorithm that encodes and decodes each character in 0{1) worst- 
case time. 

For the sake of completeness, we summarize and prove their inequalities as the 
following lemma: 



log 



i + 2n 




log 



i + 2n 




m 



i + 2n 




log 



max occi 
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Proof. Let 

i + 2n 



log -7 N 

max ( occ(s[z], s[l.i]) — [log^^^ mJ , 1 j 



i=l 

m 

y~^log(i + 2n) — log max (occ(s[z], — [log'*''^ mJ , 1^ + m. 



< 

Since {occ(s[i], s[l.i]) : 1 < i < m} and {j : 1 < j < occ(a, s), a a character} 
are the same multiset, 

m occ(a,s)- [log^/^ mJ 

L < ^ log(i + 2n) - ^ ^ log j + m 

i=l a j=l 

m occ(a,s) 

< log i + 2n log(m + 2n) — ^ log j + nlog^^^ m + m 

i—l a j—1 

= log(m!) — log(occ(a, s)!) + m + 0{n log^^'^ m) . 

a 

Therefore, by Stirling's Formula,* 

L < m log m — m In 2 — (occ(a, s) log occ(a, s) — occ(a, s) In 2) + m + 0{n \og^' 

a 

= ^ occ(a, s) log — ^ — - + m + 0(n log^^^ m) 

occia, s) 

= (i? + l)m + 0(n log^/^ m) . 
The second line of the above equality uses the fact that m = ^ occ((, a)). 

5 Lower bound 

It is not difficult to show that any prefix coder uses at least {H + l)m — o(m) 
bits in the worst case (e.g., when s consists of to — 1 copies of one character 
and 1 copy of another, so Hm < log to + loge). However, this does not rule 
out the possibility of an algorithm that always uses, say, at most to/2 more bits 
than static Huffman coding. Vittcr [18] proved such a bound is unachievable 
with an adaptive Huffman coder, and we now extend his result to all adaptive 
prefix coders. This implies that for an adaptive prefix coder to have a stronger 
worst-case upper bound than ours (except for lower-order terms), that bound 



Since log(m!) — log(occ(a, s)!) — log (m!/ occ(a, s)!) is the logarithm of the 
number of ways to arrange the characters in s, from this point we could also establish 
our claim by purely information theoretic arguments. 
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can be in terms of neither the empirical entropy nor the number of bits used by 
static Huffman coding.^ 

Theorem 2. Any adaptive prefix coder stores s in at least m — o{m) more hits 
in the worst case than static Huffman coding. 

Proof. Suppose n = m^/'^ — 2^ + 1 and the first n characters of s arc an enumer- 
ation of the alphabet. For n < i < m, when the adaptive prefix coder reaches 
s[i], there are at least two characters assigned codewords of length at least £+1; 
therefore, in the worst case, the coder uses at least {£+ l)m — o{m) bits. On the 
other hand, a static prefix coder can assign codewords of length £ to the n — 2 
most frequent characters and codewords of length £+1 to the two least frequent 
ones, and thus use at most £m + o{m) bits. Therefore, since a Huffman code 
minimizes the cixpected codeword length, any adaptive prefix coder uses at least 
m — o(m) more bits in the worst case than static Huffman coding. 

6 Online stable sorting 

Consider s as a multiset of characters and suppose we want to sort it stably, 
online and using only binary comparisons. A stable sort is one that preserves 
the order of equal elements, and by 'online' we mean every comparison must 
have the character we read most recently as one of its two arguments. Gagie [6] 
noted that, by replacing Shannon's construction by a modified construction due 
to Gilbert and Moore [8], his coder can be used to sort s using {H + 2)m + o{m) 
comparisons and O(logn) worst-case time for each comparison. We can use our 
results to speed up Gagic's sorter when n = o{^m/ \ogm). 

Whereas Shannon's construction assigns a prefix-free binary codeword of 
length [log(l/pi)] to each character with probability > 0, Gilbert and Moore's 
construction assigns a codeword of length [log(l/pj)] +1. If we take the trie of 
the codewords, label the leaves from left to right with the characters in the 
alphabet and label each internal node with the label of the rightmost leaf in its 
left subtree, the result is a leaf-oriented binary search tree. Building this tree 
takes 0{n) time because, unlike Shannon's construction, we do not need to sort 
the characters by probability. Hence, although we don't know how to update the 
alphabetic tree efficiently, we can construct it from scratch in 0{n) time. We can 
apply the same approach as in previous sections, and use searching with delays: 
while we use the optimal alphabetic tree for s[l..i — n/2] to identify symbols 
s[z], s[i + 1], . . . , s[i + n/2], we construct the tree for s[1..2] in the background. If 
we use 0(1) time per symbol to construct the next optimal tree, the next tree 

^ Notice we do not exclude the possibility of natural probabilistic settings in which 
our algorithm is suboptimal — e.g., if s is drawn from a memoryless source for which 
a Huffman code has smaller redundancy than a Shannon code, then adaptive Huff- 
man coding almost certainly achieves better asymptotic compression than adaptive 
Shannon coding — but in this paper we are interested only in worst-case bounds. 
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will be completed when s[i + n/2] is identified. Hence, we identify each character 
using at most 



log 



i + n 



max(occ(s[i], s[l.i]) — n, 1) 



1 



comparisons and 0(1) worst-case time for each comparison. A detailed descrip- 
tion of the algorithm wil be given in the full version of this paper. The following 
technical lemma, whose proof we omit because it is essentially the same as that 
of Lemma 4, bounds the total number of comparisons we use to sort s and, thus, 
implies our speed-up. 

Lemma 5. 



E 



log 



i + n 



max(occ(s[z], s[l..i]) — n, 1) 



■m<{H + 2)m + 0{'n?' log m) . 



Theorem 3. If n = o{^/mJ\ogm), then we can sort s stably and online us- 
ing {H + 2)m + o{m) binary comparisons and 0{1) worst-case time for each 
comparison. 



We prove the following theorem by essentially the same arguments as for 
Theorem 2. It shows that, when n = o(i/m/logm), our sorter is essentially 
optimal. We leave as an open problem finding a sorter that uses {H + 2)m + o{m) 
comparisons and 0(1) worst-case time per comparison when n is closer to m. 



Theorem 4. Any online stable sort uses at least {H + 2)m — o(m) binary com- 
parisons in the worst case. 

Proof. Suppose n = m^/^ = 2^+^ + 2, s contains only the even-numbered char- 
acters in the alphabet, and the first 2^ + 1 characters of s are an enumeration of 
those characters. It is not difficult to show that, for 2^ -|- 1 < i < m, the online 
stable sorter must determine the identity of s[i], in case it is an odd- numbered 
character; moreover, it must do this before reading s[i-|-l], in case all the remain- 
ing characters in s are smaller than s[i]'s predecessor in {s[j] : 1 < j < i} ot 
larger than its successor. Therefore, after s[2^-|-l], we can view the sorter as pro- 
cessing each character using a binary decision tree with n leaves, labelled from 
left to right with the characters in the alphabet in lexicographic order. (Since we 
are proving a worst-case lower bound, we assume without loss of generality that 
the sorter is deterministic.) For 2^ + 1 < i < m, when the sorter reaches s[i], there 
must be a leaf at depth at least £+2 that is labelled with an even- numbered char- 
acter in the alphabet. Therefore, even when s contains only the even-numbered 
characters in the alphabet, the sorter uses at least {£ + 2)m — o(m) compar- 
isons in the worst case. Since s contains only 2^ -|- 1 distinct characters, however, 
H < log(2^ -I- 1) and calculation shows {H -\- 2)m - o{m) <{t-\- 2)m - o(m). □ 
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7 Other Coding Problems 

Several variants of the prefix coding problem were considered and extensively 
studied. In the alphabetic coding problem [8] , codewords must be sorted lexico- 
graphically, i.e. i < j c{ai) < c{aj), where c(afe) denotes the codeword of ak- 
in the length-limited coding problem, the maximum codeword length is limited 
by a parameter F > log n. In the coding with unequal letter costs problem, one 
symbol in the code alphabet costs more than another and we want to minimize 
the average cost of a codeword. All of the above problems were studied in the 
static scenario. Adaptive prefix coding algorithms for those problems were con- 
sidered in [5] . In this section we show that the good upper bounds on the length 
of the encoding can be achieved by algorithms that encode in 0(1) worst-case 
time. The main idea of our improvements is that we encode a symbol s[i] in the 
input string with a code that was constructed for the prefix s[l..i — d] of the input 
string, where the parameter d is chosen in such a way that a corresponding (al- 
most) optimal code can be constructed in 0{d) time. Using the same arguments 
as in the proof of Lemma 4 we can show that encoding with delays increases the 
length of encoding by an additive term of 0{d ■ nlogm) (the analysis is more 
complicated in the case of coding with unequal letter costs). We will provide 
proofs in the full version of this paper. 

Alphabetic Coding. The algorithm of section 6 can be used for adaptive 
alphabetic coding. The length of encoding is {{H + 2)m + o{m) provided that 
n = o{^m/ logm). Unfortimatcly we cannot use the encoding and decoding 
methods of section 2 because the alphabetic coding is not canonical. When an 
alphabetic code for the following group of 0{n) symbols is constructed, we also 
create in 0{n) time a table that stores the codeword of each symbol a^. Such a 
table can be created from the alphabetic tree in 0{n) time; hence, the complexity 
of the encoding algorithm is not increased. We can decode the next codeword by 
searching in the data structure that contains all codewords. Using a data struc- 
ture due to Andersson and Thorup [2] we can decode in 0(min(vTogn, log log m) 
time per symbol. 

Theorem 5. There is an algorithm for adaptive alphabetic prefix coding that en- 
codes and decodes each symbol of a string s inO{l) and 0(min(\/logn, log log m)) 
time respectively. If n = o{^m/\ogm), the encoding length is ({H + 2)m + o{m). 

Coding with unequal letter costs. Krause [12] showed how to modify 
Shannon's construction for the case in which code letters have different costs, e.g., 
the different durations of dots and dashes in Morse code. Consider a binary chan- 
nel and suppose cost(O) and cost(l) are constants with < cost(O) < cost(l). 
Krause's construction gives a code such that, if a symbol has probability p, then 
its codeword has cost less than \n{p)/C + cost(l). where the channel capacity 
C is the largest real root of e~'=°'^*(°)'^ -|- q-cosi{i)-x _ ^ ^ -g ^j^g ^g^gg ^j^g 
natural logarithm. It follows that the expected codeword cost in the resulting 
code is ln2/C + cost(l), compared to Shannon's bound of H\n2/C. Based on 
Krause's construction, Gagie gave an algorithm that produces an encoding of s 
with total cost at most { ^^^ + cost(l)) m + o{m) in O(mlogn) time. Since the 
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code of Krause [12] can be constructed in 0{n) time, we can use the encoding 
with delay n and achieve 0(1) worst-case time. Since the costs are constant and 
the minimum probability is f2{l/m), the maximum codeword length is O(logm). 
Therefore, we can decode using the data structure described above. 

Theorem 6. There is an algorithm for adaptive prefix coding with unequal letter 



costs that encodes and decodes each symbol of a string s in 0(1) and 0(min(\/logn, loglogm)) 
time respectively. Ifn = o{y/m/ logm), the encoding length is { ^^^ + cost(l)) m+ 
o{m). 

Length-limited coding. Finally, we can design an algorithm for adap- 
tive length-limited prefix coding by modifying the algorithm of section 4. Us- 
ing the same method as in [5] — i.e., smoothing the distribution by replac- 
ing each probability with a weighted average of itself and 1/n — we set the 
codeword length of symbol s[i] to [log (2!-i)x+i/n ^ instead of [log^], where 

max(occ(s[i],s[l..i])-[log3''^ mj,l) a j: TP ^ iir 1, iU j. j-i, j 

X = j^::^ and f = t — logn. We observe that the code- 

word lengths li of this modified code satisfy the Kraft-McMillan inequality: 

^2-''<5]((2/-l)a; + l/n)/2/ 

i X 

= ^(2/x)/2^ - + "(V«)/2^ = 1 



Therefore we can construct and maintain a canonical prefix code with code- 
word lengths k. Since ^^fj-^i+y^ < m.m{j^^,0^), \log ^^f-il^e+i/J ^ 
min([log (2i ' -i)a; 1 ' logn -h /) .. Thus the codeword length is always smaller than 
F. We can estimate the encoding length by bounding the first part of the above 
expression: [log (27^1 < a; -hi -Flog and log ^ = (1/2^) log(l-h^)2' < 
2rj^- Summing up by all symbols s[i], the total encoding length does not exceed 

, i + 2n m 

/ . ... ... 3/2 , .\ +^ + 



i=l 



max(occ(s[i],s[l..z])- [log^/^mJ,l) 2/ln2 



We can estimate the first term in the same way as in Lemma 4; hence, the length 
of the encoding is {H +1+ ^f^^^ )m+0{n log^'^ m). We thus obtain the following 
theorem: 

Theorem 7. There is an algorithm for adaptive length-limited prefix coding that 
encodes and decodes each symbol of a string s in 0{1) time. The encoding length 
is {H+l + ^j^j^)m + 0{nlog^^^ m), where f = F — logn and F is the maximum 
codeword length. 
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